We load the speeches made by legislators in the UK House of Commons in the 2014 calendar year, with which we will be working with throughout this seminar, regarding Structural topic modeling in order to understand how parliamentary speeches varies by the gender of the MP.
# We load the data that will be used
PATH <- '~/Desktop/TERM 1/SMI/seminar5/data'
load("~/Desktop/TERM 1/SMI/seminar5/data/hoc_speeches.RData")
Now, we can perform some summary statistics to see the general patterns and a quick explanation of our data.
summary(speeches)
date name party gender speech ntoken
Min. :2014-01-06 Length:7590 Length:7590 female:1670 Length:7590 Min. : 1.0
1st Qu.:2014-03-12 Class :character Class :character male :5920 Class :character 1st Qu.: 50.0
Median :2014-06-23 Mode :character Mode :character Mode :character Median : 80.0
Mean :2014-06-24 Mean : 171.6
3rd Qu.:2014-10-21 3rd Qu.: 132.0
Max. :2014-12-18 Max. :6342.0
We can see that there is a signficant, gender-wise class imbalance gender-wise, since female speeches are just a 22% while male speeches correspond to the 78% out of the 7590 speeches contained in the dataset.
Then, we can check for duplicated speeches contained in the data and see that there are, in fact, some duplicated ones: 7139 unique speeches versus the original 7590 ones.
length(unique(speeches$speech))
[1] 7139
However, since they may have been performed by different members of the party (therefore, maybe performed by men and women), we will not drop them, just to make sure we do not miss information by furtherly imbalancing the dataset.
Finally, we can check the number of speeches performed by each party in the UK House of Commons:
speeches %>%
group_by(party) %>%
summarise(observations = length(party))
where we can see that 56% have been performed by the Conservative party, 34% by the Labour party and 10% by the Liberal Democrat party, therefore checking which party may have for representatives in the House of Commons (hence, more decision power).
Following, we will use the functions in the quanteda package to turn this data into a corpus object. We will save the speeches data with the meta parameter of the corpus() function, so we can later access it. Furthermore, we can also check all the detected languages of the speeches, which are logically all in english.
speeches_corpus <- corpus(speeches, text_field = 'speech', meta = speeches)
languages <- detect_language(speeches_corpus)
table(languages)
languages
en
7590
To study the corpus more thoroughly, we can plot the length of each document with the ntoken() function. Here, with the not-cleaned original data, tokens will be words, punctuation characters, numbers… Let’s plot it and further analyze it.
ntokens_speeches <- ntoken(speeches_corpus)
data.frame(ntokens_speeches) %>% ggplot(aes(ntokens_speeches)) + geom_histogram() + xlab('Number of tokens')
First, we can see that there are many speeches with less than 250 tokens, approximately (where a lot of them seem to have a null-lenght). Furthermore, we can also see that the maximum amount of tokens that speeches have is around 2500 tokens. However, the amount of speeches with this high number of tokens is very low. Therefore, all things considered, we will filter the speeches, so that we just use those with a number of tokens between 50 and 1000 tokens.
docs_length50_1000 <- names(ntokens_speeches[(ntokens_speeches>=50) & (ntokens_speeches<=1000)])
speeches_corpus <- speeches_corpus[names(speeches_corpus) %in% docs_length50_1000]
We can now furtherly perform some summary statistics on the filtered data, and see the main features of each speech. We will just show the 10 first elements, so that the output is not unnecessarily long.
summary(speeches_corpus, 10)
Corpus consisting of 6057 documents, showing 10 documents:
Text Types Tokens Sentences date name party gender ntoken
text1 78 109 5 2014-03-13 Lyn Brown Labour female 109
text2 66 97 8 2014-09-09 Cheryl Gillan Conservative female 97
text3 41 59 3 2014-04-08 James Morris Conservative male 59
text4 52 68 3 2014-12-08 Rehman Chishti Conservative male 68
text6 277 630 25 2014-01-09 Anne McIntosh Conservative female 642
text7 48 69 4 2014-10-13 Debbie Abrahams Labour female 71
text8 38 54 2 2014-09-08 Brandon Lewis Conservative male 54
text9 81 113 4 2014-01-06 Andy Sawford Labour male 113
text10 268 626 21 2014-03-05 Jeremy Lefroy Conservative male 634
text11 59 72 5 2014-01-16 Andrew Lansley Conservative male 72
Now, with the filtered data we want to work with, we will firstly turn this corpus into a tokens object and then into a document-feature matrix, with all the preprocessing steps applied to our dataset. More particularly, we will use the following data-cleaning steps:
Then, we can display a small portion of the resulting cleaned data, where we can see the number of times each of the showed words appear in the first considered texts:
eng_sw <- stopwords("en")
our_stopwords <- c(eng_sw, 'hon', 'can','s','make','need','one','want','say','take','sure','however',
'may','much','must','now','go','use','two','get','said','made','also','look','put','see','give')
our_stopwords
[1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves" "you"
[10] "your" "yours" "yourself" "yourselves" "he" "him" "his" "himself" "she"
[19] "her" "hers" "herself" "it" "its" "itself" "they" "them" "their"
[28] "theirs" "themselves" "what" "which" "who" "whom" "this" "that" "these"
[37] "those" "am" "is" "are" "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do" "does" "did" "doing" "would"
[55] "should" "could" "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
[64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd" "he'd" "she'd"
[73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" "we'll" "they'll" "isn't"
[82] "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't" "let's"
[100] "that's" "who's" "what's" "here's" "there's" "when's" "where's" "why's" "how's"
[109] "a" "an" "the" "and" "but" "if" "or" "because" "as"
[118] "until" "while" "of" "at" "by" "for" "with" "about" "against"
[127] "between" "into" "through" "during" "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out" "on" "off" "over" "under"
[145] "again" "further" "then" "once" "here" "there" "when" "where" "why"
[154] "how" "all" "any" "both" "each" "few" "more" "most" "other"
[163] "some" "such" "no" "nor" "not" "only" "own" "same" "so"
[172] "than" "too" "very" "will" "hon" "can" "s" "make" "need"
[181] "one" "want" "say" "take" "sure" "however" "may" "much" "must"
[190] "now" "go" "use" "two" "get" "said" "made" "also" "look"
[199] "put" "see" "give"
dfm_speeches <- tokens(speeches_corpus, remove_punct = TRUE,
remove_symbols = TRUE, remove_numbers = TRUE) %>%
tokens_remove(our_stopwords) %>%
tokens_wordstem() %>%
tokens_ngrams(n = c(1,2)) %>%
dfm() %>%
dfm_tolower() %>%
dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
dfm_speeches
Document-feature matrix of: 6,057 documents, 3,267 features (98.45% sparse) and 5 docvars.
features
docs welcom friend statement report make clear procur link local govern
text1 2 1 1 1 1 1 2 1 4 2
text2 1 1 0 0 2 0 0 0 0 0
text3 0 0 0 0 0 0 0 0 4 4
text4 1 0 0 0 0 0 0 1 0 1
text6 0 3 0 1 0 0 0 0 4 5
text7 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 6,051 more documents, reached max_nfeat ... 3,257 more features ]
Now, we can plot the length of each document again, now with the filtered, cleaned and preprocessed data, and see the changes we have done:
ntokens_speeches_2 <- ntoken(dfm_speeches)
data.frame(ntokens_speeches_2) %>% ggplot(aes(ntokens_speeches_2)) + geom_histogram() + xlab('Number of tokens')
In order to know if there are more meaningless words that have a high frequency of appearence in the documents, we can show the top \(n\) most usual words: by brute force, we can keep appending them to the list and re-run the code, until we find a set of most frequent words that could be considered as meaningful.
topfeatures(dfm_speeches, 50)
govern peopl member right friend work hous minist year support time bill
4064 3672 3307 3001 2711 2450 2354 2257 1999 1714 1674 1603
state point issu countri import debat secretari mani gentleman think way constitu
1557 1522 1518 1512 1481 1428 1427 1388 1359 1315 1292 1265
new know us local place chang come agre busi servic clear ensur
1236 1219 1215 1144 1134 1095 1092 1084 1069 1051 1026 1021
case committe number labour just part last home tax help power provid
987 966 965 954 945 940 933 892 879 870 866 866
question public
856 836
Once we have already completely cleaned our data, with all the considered stopwords, we can plot the resulting word cloud, with the textplot_wordcloud() function, and see what we obtain.
textplot_wordcloud(dfm_speeches, random_order = FALSE, rotation = 0.25,
color = RColorBrewer::brewer.pal(9, "Set2"),max_words =100,max_size = 3)
Now, we will additionally plot a word cloud just for bigrams, in order to see the most frequent-related words. Hence, we firstly create the document-feature matrix for bigrams, with all the preprocessing steps applied to our dataset.
bigrams <- tokens(speeches_corpus, remove_punct = TRUE,
remove_symbols = TRUE, remove_numbers = TRUE) %>%
tokens_remove(our_stopwords) %>%
tokens_wordstem() %>%
tokens_ngrams(n = c(2)) %>%
dfm() %>%
dfm_tolower() %>%
dfm_trim(min_termfreq = 5, min_docfreq = 0.0025, docfreq_type = "prop")
bigrams
Document-feature matrix of: 6,057 documents, 844 features (99.42% sparse) and 5 docvars.
features
docs make_clear local_govern valu_money increas_number govern_support support_work work_done local_peopl friend_make
text1 1 1 1 1 1 1 1 1 0
text2 0 0 0 0 0 0 0 0 1
text3 0 3 0 0 0 0 0 0 0
text4 0 0 0 0 0 0 0 0 0
text6 0 2 0 0 0 0 0 0 0
text7 0 0 0 0 0 0 0 0 0
features
docs valid_point
text1 0
text2 1
text3 0
text4 0
text6 0
text7 0
[ reached max_ndoc ... 6,051 more documents, reached max_nfeat ... 834 more features ]
Now, we can plot the bigram length of each document and see the changes in comparison with all words:
bigram_ntokens <- ntoken(bigrams)
data.frame(bigram_ntokens) %>% ggplot(aes(bigram_ntokens)) + geom_histogram() + xlab('Number of bigrams')
Finally, we plot the final bigram word cloud, where we have reduced the maximmum number of bigrams to plot, since we saw in the above chart that that the X-axis (number of bigrams) was reduced to a maximmum value of 40.
textplot_wordcloud(bigrams, random_order = FALSE, rotation = 0.25,
color = RColorBrewer::brewer.pal(9, "Set2"),max_words =40,max_size = 3)
Following, we will run a structural topic model for this created corpus, using the gender variable as the topic-prevalence argument and setting \(K=20\) as the number of topics. With the stm() function that computes this topic modelling, we will obtain \(K\) topics which contain the corresponding words that are most related with each other, as we can see in the graph below:
stm_speeches <- stm(documents=dfm_speeches, prevalence = ~gender,
K = 20, seed=123)
plot(x=stm_speeches, type = 'summary', n=5)
For further analyze of each topic and the words it contains, we can show the four different types of word weightings that are used to assing words to each topic: highest probability (most probable words to appear), FREX (both frequent and exclusive), lift and score (from two other popular text mining packages).
labelTopics(stm_speeches)
Topic 1 Top Words:
Highest Prob: bill, amend, claus, govern, new, hous, committe
FREX: amend, claus, new_claus, lord, bill, provis, tabl
Lift: lord_amend, govern_amend, subsect, amend, new_claus, tabl_amend, claus_amend
Score: lord_amend, amend, bill, claus, new_claus, amendment, lord
Topic 2 Top Words:
Highest Prob: legisl, data, govern, power, concern, act, protect
FREX: data, cycl, safeti, communic, legisl, mail, enforc
Lift: call_govern, communic_data, royal_mail, intercept, mail, cycl, gambl
Score: call_govern, legisl, data, cycl, mail, royal_mail, communic
Topic 3 Top Words:
Highest Prob: year, last, week, month, past, ago, three
FREX: ago, cancer, year_ago, last_week, four, women, church
Lift: three_four, first_world, church, four_half, cancer, commemor, three_year
Score: three_four, year, cancer, women, last, ago, year_ago
Topic 4 Top Words:
Highest Prob: question, minist, gentleman, ask, hous, whether, way
FREX: question, answer, ask, ladi, statement, order, written
Lift: explan, grate_minist, give_way, ask_question, soon_possibl, grate_ladi, urgent_question
Score: explan, question, answer, point_order, gentleman, ask, minist
Topic 5 Top Words:
Highest Prob: local, care, author, servic, fund, council, communiti
FREX: local, local_author, rural, author, council, fund, care
Lift: support_local, depart_environ, local_communiti, care_system, environ_food, food_rural, rural_affair
Score: local, support_local, local_author, author, rural, health, council
Topic 6 Top Words:
Highest Prob: constitu, invest, london, friend, plan, citi, new
FREX: rail, road, london, citi, infrastructur, invest, transport
Lift: factori, main_line, railway_line, high-spe, railway, high_speed, south-west
Score: factori, invest, rail, citi, london, railway, station
Topic 7 Top Words:
Highest Prob: member, debat, friend_member, friend, hous, mr, right
FREX: friend_member, member, right_member, member_parliament, mr, member_north, north
Lift: speech_member, member_gainsborough, edward_leigh, gainsborough_sir, sir_edward, north_mr, member_north
Score: member, speech_member, friend_member, mr, debat, north, right_member
Topic 8 Top Words:
Highest Prob: work, peopl, job, young, employ, young_peopl, wage
FREX: minimum_wage, minimum, wage, young_peopl, apprenticeship, young, employ
Lift: work_experi, low_pay, work_programm, nation_minimum, minimum_wage, support_work, jobseek_allow
Score: work_experi, wage, young_peopl, young, work, employ, apprenticeship
Topic 9 Top Words:
Highest Prob: busi, small, compani, bank, govern, insur, small_busi
FREX: small_busi, busi, insur, pub, small, busi_rate, corpor
Lift: small_busi, busi_rate, medium-s, small_medium-s, high_street, depart_busi, busi_innovat
Score: busi_rate, busi, small_busi, small, compani, bank, pub
Topic 10 Top Words:
Highest Prob: govern, labour, sinc, food, poverti, year, polici
FREX: poverti, labour_govern, credit, food, labour, food_bank, child_care
Lift: child_poverti, child_care, food_bank, poverti, univers_credit, live_crisi, inequ
Score: child_poverti, labour, poverti, food_bank, child_care, food, univers_credit
Topic 11 Top Words:
Highest Prob: school, children, state, secretari_state, secretari, educ, teacher
FREX: school, teacher, children, educ, free_school, educat, parent
Lift: head_teacher, meal, primari_school, free_school, teacher, pupil, school
Score: meal, school, teacher, children, free_school, educ, parent
Topic 12 Top Words:
Highest Prob: court, justic, right, prison, law, crimin, european
FREX: prison, justic, court, sentenc, arrest_warrant, offend, ukrain
Lift: legal_aid, arrest_warrant, european_arrest, magistr, judici_review, justic_system, offend
Score: legal_aid, court, justic, prison, russia, ukrain, crimin
Topic 13 Top Words:
Highest Prob: tax, money, govern, budget, spend, rate, million
FREX: tax, budget, spend, incom_tax, money, incom, borrow
Lift: council_tax, 50p, exchequer, incom_tax, annuiti, chancellor_exchequer, help_buy
Score: council_tax, tax, incom_tax, budget, incom, rate, money
Topic 14 Top Words:
Highest Prob: hous, peopl, benefit, home, sector, rent, privat
FREX: rent, landlord, disabl, mental, tenant, mental_health, privat_rent
Lift: disabl_live, live_allow, privat_rent, rent_sector, capabl_assess, work_capabl, landlord
Score: disabl_live, rent, disabl, tenant, landlord, privat_rent, rent_sector
Topic 15 Top Words:
Highest Prob: parti, wale, vote, elect, scotland, power, northern
FREX: scotland, ireland, northern_ireland, scottish, wale, northern, devolut
Lift: english_vote, peopl_scotland, vote_english, scottish_govern, scottish_referendum, english_law, scottish
Score: vote_english, scotland, northern_ireland, wale, devolut, ireland, scottish
Topic 16 Top Words:
Highest Prob: home, polic, forc, secretari, home_secretari, individu, crime
FREX: polic, home_secretari, passport, arm, investig, forc, arm_forc
Lift: polic_forc, rape, metropolitan_polic, polic_servic, passport, shadow_home, polic_offic
Score: polic_forc, polic, home_secretari, crime, passport, arm_forc, terror
Topic 17 Top Words:
Highest Prob: countri, minist, govern, state, right, prime, prime_minist
FREX: palestinian, israel, iraq, israeli, peac, european_union, isil
Lift: iran, nick, geneva, sunni, hama, assad, diplomat
Score: nick, palestinian, iraq, isil, israel, syria, israeli
Topic 18 Top Words:
Highest Prob: peopl, know, minist, gentleman, think, nhs, right
FREX: speaker, deputi, nhs, deputi_speaker, hospit, trust, mr_speaker
Lift: shout, madam, madam_deputi, health_secretari, deputi_speaker, deputi_prime, deputi
Score: shout, nhs, speaker, hospit, patient, deputi, prime_minist
Topic 19 Top Words:
Highest Prob: friend, right, point, committe, issu, rais, import
FREX: select_committe, select, import_point, rais, friend_make, clerk, rais_issu
Lift: philip_davi, member_shipley, shipley, philip, shipley_philip, right_rais, gentleman_rais
Score: philip_davi, committe, friend, select_committe, select, rais, right_friend
Topic 20 Top Words:
Highest Prob: energi, market, price, compani, consum, govern, right
FREX: energi, climat, climat_chang, price, gas, market, consum
Lift: ofgem, big_six, energi_price, energi_compani, climat_chang, carbon, electr
Score: big_six, energi, consum, price, market, gas, compani
In order to find the top three documents associated with each topic, we will use the plotQuote() function, in which we get an image with the top 3 original documents that are mostly related to each topic. We could also use the findThoughts() function (which we have mantained in the coding), but it displays the top 3 documents for each topic all together, therefore having a very ugly visualization display. Hence, we will directly use plotQuote() and nicely display the results, as below:
top3_docs <- findThoughts(model=stm_speeches,texts = speeches_corpus)
for (i in 1:20) {
plotQuote(speeches_corpus[order(stm_speeches$theta[,i], decreasing = T)[1:3]], width = 150,
main = paste('topic: ', i))
}
As we can see, if we check each topic and its most relatable words, we will find that the texts are in correspondance with their found associated topics.
Finally, we can check on which topics are women, on average, more active than men. In order to do that, we will plot the results provided by estimateEffect. Among the many parameters we can fix, the most important one is the method used to plot; we will use “difference”, since it clearly displays the comparison between men and women for each topic. By setting the first covariate to ‘men’ and the second to ‘female’, by taking the reference point at 0, each topic will be moved towards the right (if men are more active than women for that topic) or towards the left (if woman are more active than men). Furthermore, we can check this criterion by looking at the sign of the estimate value in the summary of the computed variable for the estimateEffect(). Since it displays the effect for men (gendermale), when the line is moved towards the right, the estimate value will be positive. Otherwise, it will be negative when the line is moved towards the left. Let’s show the results, so we can verify them.
gender_effect <- estimateEffect(formula = ~gender, stmobj = stm_speeches,metadata =
docvars(dfm_speeches))
summary(gender_effect)
Call:
estimateEffect(formula = ~gender, stmobj = stm_speeches, metadata = docvars(dfm_speeches))
Topic 1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0476095 0.0040813 11.665 <2e-16 ***
gendermale -0.0001506 0.0045783 -0.033 0.974
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.035701 0.003289 10.855 <2e-16 ***
gendermale 0.003040 0.003825 0.795 0.427
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0454110 0.0027941 16.25 <2e-16 ***
gendermale 0.0007707 0.0032169 0.24 0.811
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0744100 0.0038080 19.540 <2e-16 ***
gendermale -0.0005825 0.0042270 -0.138 0.89
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 5:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.052432 0.003586 14.62 <2e-16 ***
gendermale 0.004711 0.004061 1.16 0.246
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 6:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.053743 0.003928 13.683 <2e-16 ***
gendermale 0.002761 0.004392 0.629 0.53
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 7:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.060877 0.003686 16.515 <2e-16 ***
gendermale 0.002595 0.004460 0.582 0.561
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 8:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.050769 0.003659 13.876 <2e-16 ***
gendermale -0.003078 0.004100 -0.751 0.453
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 9:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.039757 0.003030 13.123 <2e-16 ***
gendermale -0.006531 0.003432 -1.903 0.057 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 10:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0484900 0.0037022 13.098 <2e-16 ***
gendermale -0.0007185 0.0044012 -0.163 0.87
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 11:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.037905 0.003634 10.43 <2e-16 ***
gendermale -0.001265 0.004087 -0.31 0.757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 12:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.041880 0.003492 11.991 <2e-16 ***
gendermale -0.002189 0.003890 -0.563 0.574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 13:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.039330 0.003202 12.284 <2e-16 ***
gendermale 0.003013 0.003785 0.796 0.426
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 14:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0412647 0.0034208 12.06 <2e-16 ***
gendermale -0.0001557 0.0038750 -0.04 0.968
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 15:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.049990 0.004193 11.921 <2e-16 ***
gendermale 0.001092 0.004531 0.241 0.81
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 16:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.054442 0.003772 14.434 <2e-16 ***
gendermale 0.001196 0.004416 0.271 0.787
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 17:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0697661 0.0045979 15.173 <2e-16 ***
gendermale -0.0005429 0.0054805 -0.099 0.921
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 18:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.066942 0.003894 17.19 <2e-16 ***
gendermale -0.006633 0.004364 -1.52 0.129
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 19:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.062449 0.003398 18.379 <2e-16 ***
gendermale -0.002024 0.003854 -0.525 0.599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Topic 20:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.026940 0.002947 9.142 <2e-16 ***
gendermale 0.004609 0.003229 1.427 0.154
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot.estimateEffect(x=gender_effect,covariate = 'gender', model=stm_speeches, method = 'difference',
cov.value1 = 'male', cov.value2 = 'female', labeltype = 'custom', custom.labels = c('1: M vs W','2: M vs W','3: M vs W','4: M vs W','5: M vs W','6: M vs W','7: M vs W','8: M vs W','9: M vs W','10: M vs W','11: M vs W','12: M vs W','13: M vs W','14: M vs W','15: M vs W','16: M vs W', '17: M vs W', '18: M vs W', '19: M vs W','20: M vs W'),
xlab='Proportion',ylab = 'Topic',main='Proportion of topics per gender', n=3,width = 100)
As we can see, the topics in which women are more active than men, the estimate values are, in fact, negative, while the ones where men are more active than women have positive estimate values.